A chromosome-level genome assembly for the eastern fence lizard (Sceloporus undulatus), a reptile model for physiological and evolutionary ecology

Abstract Background High-quality genomic resources facilitate investigations into behavioral ecology, morphological and physiological adaptations, and the evolution of genomic architecture. Lizards in the genus Sceloporus have a long history as important ecological, evolutionary, and physiological models, making them a valuable target for the development of genomic resources. Findings We present a high-quality chromosome-level reference genome assembly, SceUnd1.0 (using 10X Genomics Chromium, HiC, and Pacific Biosciences data), and tissue/developmental stage transcriptomes for the eastern fence lizard, Sceloporus undulatus. We performed synteny analysis with other snake and lizard assemblies to identify broad patterns of chromosome evolution including the fusion of micro- and macrochromosomes. We also used this new assembly to provide improved reference-based genome assemblies for 34 additional Sceloporus species. Finally, we used RNAseq and whole-genome resequencing data to compare 3 assemblies, each representing an increased level of cost and effort: Supernova Assembly with data from 10X Genomics Chromium, HiRise Assembly that added data from HiC, and PBJelly Assembly that added data from Pacific Biosciences sequencing. We found that the Supernova Assembly contained the full genome and was a suitable reference for RNAseq and single-nucleotide polymorphism calling, but the chromosome-level scaffolds provided by the addition of HiC data allowed synteny and whole-genome association mapping analyses. The subsequent addition of PacBio data doubled the contig N50 but provided negligible gains in scaffold length. Conclusions These new genomic resources provide valuable tools for advanced molecular analysis of an organism that has become a model in physiology and evolutionary ecology.


Context
Genomic resources, including high-quality reference genomes and transcriptomes, facilitate comparisons across populations and species to address questions ranging from broad-scale chromosome evolution to the genetic basis of key adaptations.Squamate reptiles, the group encompassing lizards and snakes, have served as important models in ecological and evolutionary physiology owing to their extensive metabolic plasticity [1]; diverse reproductive modes including obligate and facultative parthenogenesis [2]; repeated evolution of placental-like structures [2,3]; shifts among sex-determining systems, with XY, ZW, and temperature-dependent systems represented often in closely related species [4,5]; loss of limbs and elongated body forms [6]; and the ability to regenerate tissue [7,8].
Despite having evolved greater phylogenetic diversity than mammals and birds, 2 major vertebrate groups with extensive genome sampling, genomic resources for squamates remain scarce and assemblies at the chromosome level are even more rare [7,[9][10][11][12][13].While squamates are known to have a level of karyotypic variability similar to that of mammals [14], the absence of high-quality genome assemblies has led to their exclusion from many chromosome-level comparative genome analyses.In comparative studies, non-mammalian amniotes are often represented only by the chicken, which is divergent from squamate reptiles by almost 280 million years [15], or the green anole (Anolis carolinensis), whose genome is only 60% assembled into chromosomes and is lacking assembled microchromosomes [14,16].However, recent analyses have identified key differences that distinguish the evolution of squamate genomes from patterns found in mammals and birds [17], underscoring the need for additional high-quality genome assemblies for lizards and snakes.The development of additional squamate genomes within and across lineages will facilitate investigations of the genetic basis for many behavioral, morphological, and physiological adaptations in comparisons of organisms from the population up to higher-order taxonomic ranks.
Our goal was to develop high-quality genomic and transcriptomic resources for the spiny lizards (Sceloporus) to further our ability to address fundamental ecological and evolutionary questions within this taxon, across reptiles, and across vertebrates.The genus Sceloporus includes ∼100 species extending throughout Central America, Mexico, and the United States [18].Researchers have used Sceloporus for decades as a model system in the study of physiology [19,20], ecology [21,22], repro-ductive ecology [23][24][25], life history [26][27][28], and evolution [25,[29][30][31].The long history of research on Sceloporus species, applicability across multiple fields of biology, and the extensive diversity of the genus make this an ideal group to target for genomic resource development.
We focus on the eastern fence lizard, Sceloporus undulatus (NCBI:txid8520), which is distributed in forested habitats east of the Mississippi River [32].Recently, S. undulatus has been the focus of studies on the development of sexual size dimorphism [33,34], as well as experiments testing the effects of invasive species [35][36][37] and climate change [22,[38][39][40] on survival and reproduction as a model to better understand the consequences of increasing anthropogenic disturbance.The development of genomic resources for S. undulatus, particularly a high-quality genome assembly, will support its role as a model species for evolutionary and ecological physiology and will have immediate benefits for a broad range of comparative studies in physiology, ecology, and evolution.
To this end, we developed a high-quality chromosome-level reference genome assembly and transcriptomes from multiple tissues for S. undulatus.We apply this genome reference to datasets on 3 scales: (i) to address how assembly quality influences mapping of RNA sequencing (RNAseq) and lowcoverage whole-genome sequencing (WGS) data, (ii) to improve upon the genomic resources for the Sceloporus genus by creating reference-based assemblies of draft genomes for 34 other Sceloporus species, and (iii) to draw broad comparisons in chromosome structure and conservation with other recently published squamate chromosome-level genomes through large-scale synteny analysis.

Sequencing and assembly of the Sceloporus undulatus genome
Genome sequence data were generated from 2 male S. undulatus collected at Solon Dixon Forestry Education Center, in Andalusia, AL (31 09.49N, 86 42.10 W).The animals were euthanized and tissues were dissected, snap-frozen in liquid nitrogen, and stored at −80 We developed 3 S. undulatus genome assemblies using increasingly more data with correspondingly greater cost: (i) a SuperNova assembly containing data from 10X Genomics Chromium; (ii) a HiRise assembly containing the 10X Genomics data with the addition of Hi-C data; and (iii) a PBJelly Assembly containing the 10X Genomics data and Hi-C data, with the addition of Pacific Biosciences (PacBio) data.These assemblies are provided as supplemental files and their summary statistics are provided in Table 1.
In the fall of 2016, we sequenced DNA from snap-frozen brain tissue of a single juvenile male S. undulatus using 10X Genomics Chromium Genome Solution Library Preparation with Super-Nova Assembly [41] through HudsonAlpha.The library was sequenced on 1 lane of Illumina HiSeqX (Illumina HiSeq X Ten, RRID:SCR 016385), resulting in 774 million 150-bp paired-end reads that were assembled using the SuperNova pipeline.We refer to this assembly with 46× coverage as the SuperNova Assembly.
In the fall of 2017, we sequenced a second male (Fig. 1) from the same population using a Hi-C library with Illumina sequencing through Dovetail Genomics prepared from blood, liver, and muscle tissue.We used this second individual because the remains from the individual used for SuperNova Assembly were insufficient for Hi-C library preparation, which required 100 mg of tissue.Dovetail Genomics developed 2 Hi-C libraries that were sequenced on an Illumina HiSeqX to produce 293 million and 289 million (total 582 million) 150-bp paired-end reads.The data from both Hi-C and 10X Genomics were used for assembly via the HiRise software (v2.1.3-5ce4af34ac25)pipeline at DoveTail Genomics [42,43].This pipeline excludes contigs/scaffolds <1 kb and only uses MQ >50 reads for scaffolding, and the model fitting step uses a 10 Mb maximum.The reads were aligned with a modified SNAP pipeline.We refer to this assembly with 4,859× coverage as the HiRise Assembly.
Finally, also in fall of 2017, DNA extracted from the second adult male was used by Dovetail Genomics to generate 1,415,213 PacBio reads with a mean size of 12,418.8bp (range, 50-82,539 bp).These PacBio data were used for gap filling to further improve the lengths of the scaffolds of the HiRise Assembly using the program PBJelly (PBJelly, RRID:SCR 012091) [44], with the following parameters: -minMatch 8 -sdpTupleSize 8 -minPctIdentity 75 -bestn 1 -nCandidates 10 -maxScore -500nproc 36 -noSplitSubreads.We refer to this final assembly containing all 3 types of sequencing data as the PBJelly Assembly and the final SceUnd1.0 reference genome assembly.
For a visual comparison of our 3 S. undulatus assemblies and other squamate genomes, we graphed genome contiguity for these 3 assemblies with other squamate reptile genomes, building on the graph by Roscito et al. [45].The S. undulatus SuperNova Assembly (containing only the 10X Genomics data) is as contiguous as the bearded dragon genome assembly (Fig. 2a).The addition of the HiRise data brought a large increase in continuity.The HiRise and PBJelly S. undulatus assemblies are nearly indistinguishable from each other and are among the most contiguous squamate genome assemblies to date (Fig. 2a).
The SceUnd1.0 assembly contains 45,024 scaffolds (>850 bp, without gaps) containing 1.9 Gb of sequence, with an N50 of 275 Mb.Importantly, 92.6% (1.765 Gb) of the assembled sequence is contained within the first 11 scaffolds.Chromosomal studies have determined that the S. undulatus karyotype is 2N = 22 with a haploid genome of N = 11 (6 macrochromosomes + 5 microchromosomes) [31,46].Sorting the top 11 scaffolds by size (Fig. 2b) suggests that scaffolds 1-6 are the macrochromosomes (170-383 Mb in size) and scaffolds 7-11 are the 5 microchromosomes  Mb in size) (Fig. 2b).These results suggest that the first 11 scaffolds represent the 11 chromosomes, although the assembly also produces 45,000 smaller scaffolds between 0.85 kb and 7 Mb that may still contain relevant chromosomal segments that could not be assembled.Estimated genome size of the closely related species Sceloporus occidentalis is 2.36 Gb on the basis of Feulgen densiometry [14].Assuming that S. undulatus is similar, the 1.9 Gb of sequence in our SceUnd1.0assembly is likely either missing some data, or repeat regions have been condensed, representing redundancies.To assess the level of contamination in our SceUnd1.0genome assembly, we used Blobtools v1 (Blobtools, RRID:SCR 017618) [47] workflow A to estimate contamination based on GC content differences that exist between taxa.To visualize depth by GC content for taxa represented in the assembly, we created a blobDB using a BAM file to infer coverage and sequence similarity hits based on the DIA-MOND blast and the SceUnd1.0 assembly fasta file.Plots were produced for 2 taxonomic ranks, phylum and order, with taxonomic annotation based on the "bestsum" taxrule.The majority of the represented taxa in the assembly were annotated as belonging to Chordata (phylum level) and Squamata (order level).There is a smaller, but visible, proportion of reads that are associated with order Testudines, which is likely due to regions of sequence similarity across reptiles.Overall, the plot demonstrates negligible contamination of other taxa (Supplementary Fig. S1).
To assess the completeness of our 3 genome assemblies, we used the BUSCO (BUSCO, RRID:SCR 015008) Tetrapoda dataset (3,950 genes) [48,49].For all 3 assemblies we found >89% of BUSCO genes complete (Table 1) with only minor differences in BUSCO genes between the SuperNova, HiRise, and PBJelly Assemblies (89.5%, 90.2%, 90.9% complete).This suggests that the initial SuperNova Assembly captured nearly all the genomic content despite having considerably shorter scaffolds (Table 1).The small increase in success with the more contiguous assemblies seems to result from a reduction in fragmented BUSCO genes with increasing data.In the SuperNova Assembly, 6.4% of BUSCO genes were present as fragments whereas only 5.5% and 5.0% were present as fragments in the HiRise and PBJelly Assemblies, respectively, thus explaining the 1.4% difference in complete BUSCO genes present.Interestingly, there was a 0.2% (i.e., 8 genes) increase in missing BUSCO genes from the Su-perNova to the HiRise Assembly.In the PBJelly Assembly (Sce-Und1.0),the BUSCO genes are almost all found on the largest 11 scaffolds (Fig. 2c), as we would predict if those scaffolds corresponded to chromosomes.Most of the BUSCO genes on the smaller scaffolds were duplicated.Even so, there are a small number of complete and fragmented BUSCO genes present on a handful of the tiny scaffolds (Fig. 2c), suggesting that these scaffolds contain pieces of the chromosomes that were not properly assembled.

De novo assembly and annotation of the Sceloporus undulatus transcriptome
Samples used for the de novo transcriptome were obtained from 3 gravid female S. undulatus collected in Edgefield County, SC (33.7 • N, 82.0 • W), and transported to Arizona State University.These animals were maintained under conditions described in previous publications [50,51], which were approved by the Institutional Animal Care and Use Committee (Protocol No. 14-1338R) at Arizona State University.Approximately 2 days after laying eggs, each lizard was killed by injecting sodium pentobarbital into the coelomic cavity.Whole-brain and skeletal muscle samples were removed and placed in RNA-lysis buffer (mir-Vana miRNA Isolation Kit, Ambion) and flash-frozen.Additionally, 3 early-stage embryos from each clutch were dissected, The contig or scaffold length such that the sum of the lengths of all scaffolds of this size or larger is equal to 50% (90%) of the total assembly length; L50 (L90): The smallest number of scaffolds that make up 50% (90%) of the total assembly length.
pooled together, homogenized in RNA-lysis buffer, and also flash-frozen.
Total RNA was isolated from the embryo and 3 tissue samples from each adult female (whole brain, skeletal muscle) using the mirVana miRNA Isolation Kit (Ambion) total RNA protocol.Samples were checked for quality on a 2100 Bioanalyzer (Agilent).One sample from each tissue was selected for RNAseq based on the highest RNA Integrity Number (RIN), with a minimum cutoff of 8.0.For each selected sample, 3 μg of total RNA was sent to the University of Arizona Genetics Core (Tucson, AZ) for library preparation with TruSeq v3 chemistry for a standard insert size.RNA samples were multiplexed and sequenced using an Illumina HiSeq 2000 (Illumina HiSeq 2000, RRID:SCR 020132) to generate 100-bp paired-end reads.Publicly available raw Illumina RNAseq reads from S. undulatus liver (juvenile male) were also added to our dataset [52,53].After removing adapters, raw reads from the 4 tissues were evaluated using FastQC [54] and trimmed using Trimmomatic v-0.32 [55], filtering for quality score (≥Q20) and using HEADCROP:9 to minimize nucleotide bias.This procedure yielded 179,374,469 quality-filtered reads.Table 2 summarizes read-pair counts from whole brain, skeletal muscle, whole embryos, and liver.
All trimmed reads were pooled and assembled de novo using Trinity v-2.2.0 with default k-mer size of 25 [56].From the final transcriptome, a subset of contigs containing the longest open reading frames (ORFs), representing 123,323 transcripts, was extracted from the de novo transcriptome assembly using TransDecoder v-3.0.0 (TransDecoder, RRID:SCR 017647) [57] with homology searches against the databases UniProtKB/SwissProt [58] and PFAM [59].The transcriptome was annotated using Trinotate v-3.0 (Trinotate, RRID:SCR 018930) [60], which involved searching against multiple databases (as UniProtKB/SwissProt, PFAM, signalP, GO) to identify sequence homology and protein domains, as well as to predict signaling peptides.This pooled Tissue-Embryo Transcriptome and annotation are provided as supplemental files.
The most comprehensive transcriptome, obtained using reads from 4 tissues, consists of 547,370 contigs with a mean length of 781.5 nucleotides (Table 2)-shorter than other assemblies because of the range of contig sizes that varied among datasets (1, 3, and 4 tissues; Supplementary Table S1, Fig. S2).The N50 of the most highly expressed transcripts that represent 90% of the total normalized expression data (E90N50) was lowest in the assembly based on 1 tissue (Table 2).To validate the de novo transcriptome data, trimmed reads from the 4 tissues used for RNA sequencing (brain, skeletal muscle, liver, and whole embryos) were aligned back to the Trinity-assembled contigs using Bowtie2 v2.2.6 (Bowtie2, RRID:SCR 016368) [61].From the 176,086,787 reads that aligned, 97% represented proper pairs (Supplementary Table S2), indicating good read representation in the de novo transcriptome assembly.To assess quality and completeness of the assemblies, we first compared the de novo assembled transcripts with the BUSCO Tetrapoda dataset, with BLAST+ v2.2.31 [62] and HMMER v3.1b2 (HMMER, RRID:SCR 005 305) [63] as dependencies.This procedure revealed that the de novo transcriptome assembly captured 97.1% of the expected orthologues (sum of completed and fragmented), a result comparable to the 97.8% obtained for the green anole transcriptome using 14 tissues [64] (Table 3).Next, nucleotide sequences of de novo assembled transcripts with the longest ORFs were compared to the protein set of A. carolinensis (AnoCar2.0,Ensembl) using BLASTX (BLASTX, RRID:SCR 001653) (e-value = 1e−20, max target seqs = 1).This comparison showed that 11,223 transcripts of S. undulatus have nearly full-length (>80%) alignment coverage with A. carolinensis proteins (Supplementary Table S3).Predicted proteins of S. undulatus were also used to identify 13,422 one-to-one orthologs with proteins of A. carolinensis through reciprocal BLAST (e-value = 1e−6, max target seqs =  1).Table 4 summarizes the de novo transcriptome annotation results.

Genome assembly annotation
Using the 24 largest scaffolds of the SceUnd1.0 assembly (we refer to this set as SceUnd1.0 top24), we used the Funannotate v1.5.0 pipeline [65] for gene prediction and functional annotation.Funannotate uses RNAseq data and the Tetrapoda BUSCO [48] dataset to train the ab initio gene prediction programs Augustus [66] and GeneMark-ET [67].Evidence Modeler is used to generate the consensus from Augustus and GeneMark-ES/ET.In the training step, we used 4 raw RNAseq datasets described in Table 5 that contained a total of 68 sequenced libraries.tRNAscan-SE (tRNAscan-SE, RRID:SCR 01083 5) [68] was used to predict transfer RNA (tRNA) genes.Finally the genes were functionally annotated via InterProScan (In-terProScan, RRID:SCR 005829) [69], eggNOG (eggNOG, RRID:SC R 002456) [70], Pfam (Pfam, RRID:SCR 004726) [59], UniProtKB [58], MEROPS (MEROPS, RRID:SCR 007777) [71], CAZyme, and GO ontology.We also used DIAMOND blastp [72] to compare the predicted proteins to ENSEMBL human, chicken, mouse, and green anole lizard databases (Data archived files: Sce-Und1.0top24.gff3;SceUnd1.0 top24 CompiledAnnotation.csv).Our annotation pipeline predicted 54,149 genes, 15,472 of which were attributed meaningful functional annotation beyond "hy-pothetical protein."Through BLAST of the predicted proteincoding genes, we found 21,050 (39%) had hits in ENSEMBL.We then quantified the number of BUSCO genes identified in the predicted proteins from the Funannotate pipeline and found 79.1%, which corresponds to an 11.6% decrease from the number of complete BUSCO genes in the SceUnd1.0 genome assembly.Because there were more BUSCOs fragmented or missing from the predicted proteins (the annotation) than the actual genomic sequence itself, we attribute those to annotation errors, not errors in the assembly, which suggests that this first version of annotation can be improved.SceUnd1.1 (a slightly updated version of SceUnd1.0 based on NCBI requirements) was submitted to NCBI for annotation.The SceUnd1.1 is version JAGXEY010000000, GenBank accession GCA 019175285.1.This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JAGXEY000000000.We used annotation and sequence homology to identify the X chromosome.Sex chromosomes are highly variable among Sceloporus species, and the genus seems to have evolved multiple variations of XY systems [31].However, some species, including S. undulatus, do not seem to have morphologically distinct sex chromosomes [73].While the ancestral condition is heteromorphic chromosomes with a minute Y, many species within the genus demonstrate multiple sex chromosome heteromorphisms (i.e., multiple forms of the X chromosome) or have evolved indistinct sex chromosomes, such as the undulatus species group [18].These heteromorphisms are likely the result of other chromosomes' fusions to the X, as Sceloporus are among the large portion of iguanian lizards with conserved sex chromosomes, and another Sceloporus species within the same broad 2n = 22 radiation, Sceloporus malachiticus, has an X chromosome homologous to the green anole X but fused to several microchromosomes [74].Given this observed homology, we used known X chromosome genes from the green anole to identify the scaffold likely representing the X chromosome within S. undulatus.We blasted 16 X-linked genes from the green anole downloaded from Ensembl (AnoCar2.0:ACAD10, ADORA2A, ATP2A2, CCDC92, CIT, CLIP1, CUX2, DGCR8, FICD, MLEC, MLXIP, ORAI1, PLBD2, PUS1, TMEM119, ZCCHC8) [75,76] to SceUnd1.0.They almost exclusively map to the tenth largest scaffold (Fig. 2b), indicating that it is likely the X chromosome.The Y chromosome could not be independently identified from the assembly, most likely owing to the homomorphic nature of S. undulatus sex chromosomes; higher sequence homology may have caused the Y chromosome to assemble with the X chromosome [31].This result, that the fourth predicted microchromosome is the putative X chromosome, is further supported by a separate synteny analysis described below.

Repeat annotation and evolutionary analysis
To estimate the repetitive landscape of the S. undulatus genome, we modeled repeats de novo by running RepeatModeler v1.0.8 (RepeatModeler, RRID:SCR 015027) [77] on the SceUnd1.0 assembly.We then annotated repeats in the assembly using Repeat-Masker v4.0.7 (RepeatMasker, RRID:SCR 012954) [78] with the de novo consensus repeat library.To estimate evolutionary diver-gence within repeat families in the S. undulatus genome, we generated repeat-family-specific alignments and calculated the average Kimura-2-parameter divergence from consensus within each family, correcting for high mutation rates at CpG sites with the calcDivergenceFromAlign.pl RepeatMasker tool.We compared the divergence profiles of S. undulatus and A. carolinensis by completing parallel analyses.We annotated repeats in the A. carolinensis genome (AnoCar2.0)with RepeatMasker and the "anolis" repeat library from RepBase release 20170127 [79].
The S. undulatus assembly contained a diverse repertoire of repeats including transposable elements, the most abundant of which are the long interspersed nuclear elements (LINEs, Supplementary Table S4) comprising ∼15% of the genome.Relative proportions of LINEs, short interspersed nuclear repeats (SINEs), long terminal repeat (LTR) retrotransposons, and DNA transposons were similar to those of A. carolinensis.The diversity of repeat elements in S. undulatus mirrors that of the Anolis genome [80], as well as that of other squamates [17].However, the age distribution of elements between the 2 genomes was vastly different (Fig. 3).For instance, a much larger proportion of the Anolis genome was comprised of transposable element insertions ≤10% from their family consensus.This indicates an overabundance of inserts resulting from recent activity in A. carolinensis relative to Sceloporus.In particular, the Anolis genome contained far more recent SINEs (Kruskal-Wallis test; P = 9.374e−05).The distribution of recent LINEs was significantly different between the 2 genomes (P = 2.824e−06), and Anolis contained more recent insertions from the L1 family (P = 0.0001571), as well as RTE-BovB (P = 0.001152) and R4 (P = 0.0001571).The Anolis genome also contained more recent LTR retrotransposons (P = 1.153−e07), as well as Mariner (P = 0.0002122), Tigger (P = 0.01017), and Chapaev (P = 0.001152) DNA transposons.

Mitochondrial genome assembly
The mitochondrial genome was not captured by the genome sequencing approaches, likely owing to how these types of libraries are prepared.However, mitochondrial sequence data obtained via RNAseq can be effectively assembled into whole mitochondrial DNA (mtDNA) genomes [81][82][83][84].We used RNAseq reads from 18 S. undulatus individuals from the RNAseq Dataset 4 (Table 5), which are from the same population as the individuals used for the genome sequencing.We used Trimmomatic v0.37 (Trimmomatic, RRID:SCR 011848) [55] to clean the raw reads and then mapped the clean reads to a complete S. occidentalis mtDNA genome [85] using BWA v0.7.15 (BWA, RRID:SCR 010910) [86].Of the 632,987,330 total cleaned reads, 9.73% mapped to the S. occidentalis mtDNA genome with an average read depth of 5,164.42reads per site per individual.After sorting and indexing mapped reads with SAMTOOLS v1.6 (SAMTOOLS, RRID:SCR 002105) [87], we used the mpileup function in SAMTOOLS to build a consensus mitochondrial genome (mtGenome) excluding the reference and filling the no-coverage regions with "N" to generate 100% coverage of the mtGenome based on the consensus across the 18 individuals.We mapped the consensus genome to the well-annotated A. carolinensis mtGenome with MAFFT v1.3.7 (MAFFT, RRID:SCR 011811) [88] and transferred the annotation using the "copy annotation" command in GENEIOUS v.11.1.5(GENEIOUS, RRID:SCR 010519) [89].Annotations from the A. carolinensis mtGenome (17,223 bp) transferred well to the newly assembled S. undulatus mtGenome (17,072 bp), with 13 proteincoding genes, 22 tRNA regions, 2 ribosomal RNA regions, and a control region (see full list in Supplemental Results File).While this genome is useful for understanding sequence variation and comparative genomics and phylogenetic analyses, this mitochondrial genome should not be used for examination of mitochondrial genome structure.The mitochondrial genome and the annotation are provided as supplemental data.

Addressing reference assembly quality using population-level transcriptomic and genomic data
In developing the high-quality reference genome for S. undulatus, we produced 3 assemblies using increasing amounts of data, for correspondingly greater costs.To assess the utility of each of the assemblies for addressing ecological genomic questions, we use 2 datasets: RNAseq and whole-genome resequencing.First, we used RNAseq Dataset 4 (Table 5) from n = 18 males that were sampled from the same population (Alabama) as the individuals that were used to develop the reference assemblies; we then used these data to test whether the percentage of reads that mapped to the reference varied depending on which assembly we used as the reference.RNAseq data were cleaned with Trimmomatic v0.37 [55] and mapped with HISAT2 v2.1.0[90] to each of the 3 S. undulatus genome assemblies.The percentages of reads that mapped were calculated using SAMTOOLS v1.6 flagstat [87].We found negligible differences in mapping the RNAseq data to the SuperNova, HiRise, and PBJelly assemblies, where 81.49%, 82.37%, and 82.28% of cleaned reads mapped, respectively (Table 6).
Second, we prepared genomic DNA libraries for massively parallel sequencing for n = 10 S. undulatus individuals (6 females, 4 males) from the same Alabama population as the individuals that were used to develop the reference assemblies.We also prepared libraries for n = 5 S. undulatus individuals (1 female, 4 males) from Edgar Evins, TN, and for n = 5 individuals (2 females, 3 males) from St. Francis, AR.This Arkansas population is at the borders of the S. undulatus and Sceloporus consobrinus geographic distributions, making its taxonomic status uncertain [18].Specifically, we followed standard protocols for tissue DNA extraction from toe and/or tail clips with OMEGA EZNA Tissue spin-column kits.We then prepared sequencing libraries using the Illumina TruSeq Nano kit.We multiplexed these libraries with other individuals not included in this analysis and sequenced the library pool across 2 Illumina NovaSeq 6000 S4 sequencing runs.Five individuals from each of the 3 populations were sequenced to ∼20× average read coverage; the remaining 5 individuals from Alabama were sequenced to lower coverage (∼3×).Raw sequence read data were trimmed with Trimmomatic [55] and mapped separately to each of the 3 S. undulatus assemblies with bwa mem [86].SAMTOOLS flagstat [87] was used to calculate the total number of alignments in the .samfiles generated during mapping and the number of shotgun reads that mapped to each assembly.The CollectWgsMetrics tool from the Picard Toolkit [91] was used to calculate genome-wide coverage of the mapped reads for each individual and assembly, and theoretical HET SNP sensitivity (a metric based on coverage and base-quality distribution that estimates the probability of calling a true heterozygote SNP) as a way to predict the utility of each assembly as a reference for calling SNPs at high and low coverage.For all sequencing depths and populations, we observed fewer total alignments to the PBJelly Assembly than to either the HiRise or Supernova Assemblies (Table 6).Even though there were <0.5% fewer total reads that passed quality control (QC) with the PBJelly Assembly/ SceUnd1.0, a higher percentage Datasets were mapped to either the SuperNova Assembly containing only the 10X Genomics Chromium data, the HiRise Assembly containing 10X Genomics Chromium and Hi-C data, or the PBJelly assembly (SceUnd1.0)containing 10X Genomics Chromium, Hi-C, and PacBio data.Mean SAMTOOLS QC-passed reads, reads mapped, and percentage of mapped QC-passed reads for every sequencing depth and population are shown along with mean whole-genome coverage and theoretical HET SNP sensitivity for every assembly and population.Data are available in NCBI BioProject: PRJNA656311.
of the QC-passed reads mapped to this assembly than to either the HiRise or Supernova Assemblies (Table 6).We also determined that individuals from the same population as the S. undulatus individuals used to create these reference assemblies had a higher percentage of reads map to the assemblies than individuals from the Tennessee or Arkansas populations (Table 6).Those reads had lower whole-genome coverage and lower theoretical HET SNP sensitivity when mapped to the PBJelly/SceUnd1.0Assembly than either the HiRise or Supernova Assemblies (Table 6).This may be due to repetitive regions being added to the assembly by the PacBio data, making it slightly less mappable.Both the RNAseq and the whole-genome resequencing datasets support the conclusion that the 10X Chromium data that were used for the SuperNova Assembly covered the genome sufficiently to be a good reference for mapping RNAseq and WGS data and that the HiC data (included in the HiRise Assembly) and the PacBio data (included in the final PBJelly Assembly) did not increase the amount of sequence information.Rather, the use of the HiC data and PacBio data resulted in larger scaffolds, which will aid in understanding the genomic context of expression data and sequence variants.

Assembly and refinement of genomic data for 34 additional Sceloporus species
Draft reduced-representation genomes are available for 34 species within Sceloporus [92,93] (phylogeny in Fig. 4a).We downloaded the raw genomic reads for these 34 Sceloporus species from the SRA (Study Accession SRP041983; Table 7).Genomic resources for 33 of the species were obtained using reducedrepresentation libraries (yielding ∼5 Gb per species), while 1 species, S. occidentalis, was sequenced using whole-genome shotgun sequencing (40.88Gb; Table 7) [92].To improve the draft assemblies for these 34 species, we mapped these raw reads to the final assembly, SceUnd1.0, using BWA-MEM [94].Only the 11 longest, putative chromosome scaffolds from the SceUnd1.0 were used.The GATK version 3 [95][96][97] RealignerTargetCreator and IndelRealigner tools were used for local realignment, and HaplotypeCaller was used to identify insertion/deletion (INDEL) and single nucleotide polymorphism (SNP) variants.These sequence variants were separated and filtered with the SelectVariants and VariantFiltration tools using the GATK base settings.BEDTools [98] "genomecov" tool was used to calculate coverage and identify regions with no coverage.We generated consensus sequences for each species by writing variants back over the reference fasta and replacing nucleotides with no coverage with "N", using BCFtools [87] "consensus" for SNPs and BEDTools "maskfasta" for indels and regions with no mapping coverage (Supplemental Code File).
Mapping the reduced representation genome data from the 33 additional Sceloporus species improved the assemblies for each species.It seems there was a considerable amount of bycatch in many of the reduced-representation sequences that is normally filtered out when those reduced representation data are analyzed.For the species with ∼5 Gb of sequencing data, we improved the genome coverage from a mean of 1.23% to a mean of 44.4% coverage at low depth (1-3×) (Supplementary Fig. S3).For S. occidentalis with ∼41 Gb of data, coverage improved from 61.0% to 88.7% (Table 7), at an average ∼20× depth (Supplementary Fig. S3).Across the 33 species with ∼5 Gb of data, the BUSCO genes identified (complete and fragmented) in the reference-based assemblies ranged from 0.5% to 71.9% (complete and fragmented), whereas S. occidentalis had 95.9% BUSCO genes (complete and fragmented) identified, similar to our S. undulatus SuperNova Assembly (Table 7).Notably, across the Sceloporus genus, the percentage of the raw data that mapped to the reference was negatively correlated with divergence time to the reference, S. undulatus (P < 0.0001, r = 0.779; Fig. 4b).For species that are less than ∼20 million years diverged from S. undulatus, >90% of reads mapped; the percentage of reads mapped declined to 75% when divergence was >35 million years (Fig. 4b).Genomic resources for 34 of the species were obtained using reduced representation libraries [93], while 1 species, S. occidentalis, was sequenced using whole-genome shotgun sequencing [92].The data were downloaded from the SRA (Study Accession SRP041983 [93]).Gigabases refer to the amount of sequence data for each library.The color of the dots represents the percent of the genome that is covered, which was affected by the number of redundant sequences in the reduced representation library for a particular species.
It is important to note that the reference-based assemblies produced for these 34 species will correspond 1:1 with the synteny of the S. undulatus scaffolds.However, Sceloporus is notable among squamates for remarkable chromosome rearrangements with karyotypes ranging from 2N = 22 to 2N = 46 [31].Therefore, the genome assemblies for species with karyotypes other than 2N = 22 (the S. undulatus reference) or with large chromosomal inversions will not be reliable for addressing questions related to genomic architecture or structural variation [100].However, these draft genomes contain a substantial amount of data that can be used for comparative genomic analyses.Supplementary Fig. demonstrates the overlap in coverage of SceUnd1.0 by the reference-based genome assemblies.These distributions estimate that 50% of the genome would be covered by a subset of 16 species.Focusing on 1 gene of interest to our group, IGF1, we found that 16 of the 34 species had >75% coverage across the protein-coding region of this gene and 24 of them had >50% coverage (Supplementary Fig. S4).Thereby, this dataset should prove useful for analyses of protein and gene sequence evolution to understand behavioral ecology, physiology, developmental biology, and more.

Analysis of synteny with other squamate chromosome-level genomes
As another benchmark of genome completeness, and to generate an initial look at chromosome evolution among squamates, we performed synteny analysis of the eastern fence lizard (S. undulatus) SceUnd1.0 assembly with the green anole (A.carolinensis, AnoCar2.0) and with recently published chromosomelevel assemblies for the Burmese python (Python bivittatus) [101] and the Argentine black and white tegu lizard (Salvator merianae) [45] (available at NCBI BioProject: PRJNA473319).The SceUnd1.0 scaffolds representing the 11 putative chromosomes were each divided into 1,000-bp-long sequences that excluded gapped regions to serve as markers.Using BLAST, these markers were compared to the predicted chromosomes from the python and tegu HiC assemblies.BLAST hits for each were filtered to only include unique hits that had >80% identity and were ≥500 bp long and part of 4 consecutive hits from the same eastern fence lizard chromosome, a method previously used for synteny analysis for the prairie rattlesnake [102].Using these results, the eastern fence lizard chromosomes were painted onto the anole, python, and tegu chromosomes to visualize large-scale synteny (Fig. 5).
The decreased chromosome number in the S. undulatus species group compared to other Sceloporus lineages and the Iguanian group has long driven a hypothesis that a high number of fusions occurred in chromosomes in this species group, which is evident in the marker-based synteny painting of the S. undulatus genome.While the incomplete nature of the green anole genome, especially the lack of microchromosomes, makes many Sceloporus lineage-specific fusions difficult to identify, the inclusion of the tegu and python genomes provides guidance.For example, tegu chromosomes 6, 7, 9, and 12 are all syntenic to fence lizard chromosome 6.However, the tegu chromosomes 6 and 7 occur in a single block as the python X chromosome, and we cannot discern whether this was a fusion in a lineage preceding Iguanians and snakes or a fission in the tegu.The tegu chromosomes 9 and 12 are syntenic to python chromosomes 8 and 12, which may have been fused in the fence lizard, considering the considerable size difference between fence lizard chromosome 6 and the syntenic green anole chromosome 6.Similarly, tegu chromosomes 8 and 16 are syntenic to python chromosomes 10 and 16, fusing to form the fence lizard chromosome 9 but almost completely absent from the green anole assembly.These synteny results further support that the tenth largest chromosome in the SceUnd1.0 assembly is syntenic to the anole X chromosome (Fig 5).However, it is not syntenic to the python X chromosome, which is syntenic to the Z chromosome in other snakes.The tegu sex chromosome has not been identified.On the basis of the blast hits from the anole X-linked genes and this synteny analysis we define the tenth largest chromosome in the SceUnd1.0 assembly as the putative X chromosome, but functional data are needed to confirm this assignment.

Discussion
For the advancement of reptilian genomic and transcriptomic resources, we provide a high-quality, chromosome-level genome assembly for the eastern fence lizard, S. undulatus, de novo transcriptomes for S. undulatus encompassing multiple tissues and life stages, and improved draft genome assemblies from 34 additional Sceloporus species.In the final reference assembly, Sce-Und1.0,the largest 11 scaffolds contain 92.6% (1.765 of 1.905 Gb) of the genome sequence; these 11 scaffolds likely represent the 6 macro-and 5 microchromosomes of S. undulatus, based on karyotype, genome size, BUSCO analysis, and synteny with other squamate genomes.The remaining small scaffolds may contain some chromosome segments that could not be assembled, misassembled regions, or duplicated genes.
In comparing the 3 levels of reference genome assemblies, we found that the first level using only the 10X Genomics and the SuperNova Assembly contained all, or very nearly all, of the protein-coding regions of the genome within its contigs (based on BUSCO and mapping of RNAseq and whole-genome resequencing data).By including the Hi-C data, the contiguity of the HiRise Assembly dramatically improved, joining contigs into chromosome-length scaffolds, but had minimal effect on mapping percentages for either RNAseq or WGS.The inclu-sion of the PacBio data in the final PBJelly Assembly to produce SceUnd1.0closed some gaps but yielded a relatively small improvement after the already dramatic improvements from the Hi-C data.
While it is now becoming possible to obtain a reference genome assembly for almost any organism, the quality and cost of reference genome assemblies vary considerably depending on the technologies used.This presents researchers with an important question: what levels of sequencing effort and assembly quality are required for a particular ecological genomics study?Important factors that must be considered include the sequencing depth, sequence contiguity, and thoroughness of annotation.Our study demonstrates that the SuperNova Assembly was sufficient for mapping RNAseq and whole-genome resequencing, while the more expensive data from HiC and PacBio were necessary to achieve high-level continuity and chromosome-level scaffolding in the HiRise and PBJ Assemblies.
Genome assemblies of high quality and contiguity are critical for understanding organismal biology in a wide range of contexts that includes behavior, physiology, ecology, and evolution, on scales ranging from populations to higher-level clades.From RNAseq to ChIPseq (chromatin immunoprecipitation sequencing) and epigenetics, large-scale sequencing is rapidly becoming commonplace in ecological genomics to address fundamental questions of how organisms directly respond to their environment and how populations evolve in response to environmental variation.Many advanced molecular tools are typically reserved for traditional model organisms, but with the large foundation of ecological and physiological data available for S. undulatus, a high-quality reference genome opens the door for these molecular techniques to be used in this ecological model organism.For example, with the recent demonstration of CRISPR-Cas9 gene modification in a lizard, the brown anole [103], a genome reference will facilitate the application of gene drive technologies for functional genomic studies in Sceloporus lizards.This reference will provide a foundation for whole-genome studies to elucidate speciation and hybridization among closely related species utilizing low-coverage re-sequencing, or as a point of comparison with more distantly related species relative to the chromosomal inversions and large-scale genome architectural changes common in the clade.Sceloporus undulatus and other lizards in the genus Sceloporus exhibit evolutionary reversals in sexual size dimorphism and dichromatism and they have been used to demonstrate that androgens such as testosterone can inhibit growth in species (such as S. undulatus) in which females are the larger sex [19,[104][105][106].This SceUnd1.0 chromosomelevel genome assembly would support ChIPseq or in silico analyses to identify sex hormone response elements.In addition, this assembly will facilitate the identification of signatures of exposure to environmental stressors in both gene expression and epigenetic modification [107] to evaluate pressing questions on how climate change and invasive species affect local fauna.All of these uses for a chromosome-level genome assembly provide valuable extensions to ongoing work in the Sceloporus genus.

Figure 1 :
Figure 1: Adult male Sceloporus undulatus (eastern fence lizard) from Andalusia, AL, pictured outside of Samford Hall at Auburn University, (a) profile, (b) ventral, (c) dorsal view.This specimen was used for genome sequencing at DoveTail Genomics.Photo credit: R. Telemeco.

Figure 2 :
Figure 2: An evaluation of Sceloporus undulatus genome assembly quality.(a) Comparison of the contiguity of the 3 S. undulatus genome assemblies (fence lizard) relative to other squamate genome assemblies based on the log10 of the scaffold length.The X axis is the N(x) with the N50 and the N90 emphasized with a vertical line, representing the scaffold size that contains 50% or 90% of the data, respectively.The legend lists the assemblies in the order of the lines from most contiguous (top) to least contiguous (bottom).Note the Fence Lizard PBJelly (dark blue, SceUnd1.0) and Fence Lizard HiRise (light green) assemblies are the second and third from the top and are nearly indistinguishable.(b-d) Scaffold size distribution of SceUnd1.0 and the number of BUSCO genes that mapped to each scaffold.(b) The length of the first 24 scaffolds, where the first 11 scaffolds likely represent the haploid N = 11 chromosomes (6 macrochromosomes and 5 microchromosomes).The numbers above each bar represent scaffold length to the nearest Mb.The number of BUSCO genes that mapped to each scaffold based on (c) the genome assembly, and (d) the predicted proteins from the annotation.The 11 large scaffolds inferred to correspond to chromosomes have many unique and complete BUSCO genes (green), whereas the smaller contigs have duplicated BUSCOs (purple), suggesting that they are the result of reads not mapping correctly to the chromosomes.

Figure 3 :
Figure 3: Age distributions of the major repetitive elements found in the Anolis carolinensis (AnoCar2.0)and Sceloporus undulatus (SceUnd1.0)genome assemblies.The repeat landscapes depict the relative abundance of repeat types in the genome vs their Kimura divergence from their consensus.DNA: DNA transposons; LINE: long interspersed nuclear element; LTR: long terminal repeat retrotransposon; RC: rolling circle Helitron; SINE: short interspersed nuclear element.

Figure 4 :
Figure 4: Relationship between divergence time and effectiveness of using the Sceloporus undulatus assembly for reference-based mapping.(a) Phylogenetic relationships and divergence times of selected Sceloporus species, according to Leach é et al. [99].For the purpose of illustration only the species used in our analysis are shown.(b) Relationship between percent reads mapped to the S. undulatus reference genome (SceUnd1.0)and time of divergence from S. undulatus with a linear regression.The color of the dots represents the percent of the genome that is covered, which was affected by the number of redundant sequences in the reduced representation library for a particular species.

Figure 5 :
Figure 5: Marker-based synteny painting of fence lizard (Sceloporus undulatus) scaffolds/chromosomes onto the tegu (Salvator merianae), green anole (Anolis carolinensis), and python (Python bivittatus) assemblies.The color indicates synteny for that scaffold.The linkage groups representing microchromosomes in the green anole are lettered and expanded to visualize the colors.The white areas did not have a high-confidence match between the anole and the fence lizard to paint.Putative sex chromosomes are indicated with uppercase letters.

Table 1 :
Summary statistics across genome assemblies for Sceloporus undulatus

Table 2 :
Sceloporus undulatus de novo transcriptome assembly statistics [51]4 tissues comprise 3 tissues first reported in this study (brain, skeletal, and embryos) from gravid females collected in Edgefield County, SC, plus liver tissue previously reported by McGaugh et al. 2015[51].

Table 3 :
BUSCO results for transcriptomes of 2 lizard species [59]Sceloporus undulatus, the 4 tissues are the 3 tissues (brain, skeletal muscle, and embryos) first reported here with the addition of 1 tissue (liver) from McGaugh et al. 2015[51].For Anolis carolinensis, see Eckalbar et al. 2013[59]for the complete list of tissues used.

Table 4 .
Annotation of Sceloporus undulatus de novo transcriptome assembly using 4 tissues Parentheses indicate unique annotation numbers.

Table 6 .
Comparison of each genome assembly type as a reference for population-level analyses for RNAseq and WGS of Sceloporus undulatus individuals from Alabama (AL, either low or high coverage), Tennessee (TN), and Arkansas (AR)

Table 7 .
Sceloporus species with partial genomic sequence assemblies updated using SceUnd1.0 as a reference